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Abstract 

Background: The secondary structure that maximizes the number of non-crossing matchings between 
complimentary bases of an RNA sequence of length n can be computed in O(n^) time using Nussinov's dynamic 
programming algorithm. The Four-Russians method is a technique that reduces the running time for certain dynamic 
programming algorithms by a multiplicative factor after a preprocessing step where solutions to all smaller 
subproblems of a fixed size are exhaustively enumerated and solved. Frid and Gusfield designed an O(j^) algorithm 
for RNA folding using the Four-Russians technique. In their algorithm the preprocessing is interleaved with the 
algorithm computation. 

Theoretical results: We simplify the algorithm and the analysis by doing the preprocessing once prior to the 
algorithm computation. We call this the two-vector method. We also show variants where instead of exhaustive 
preprocessing, we only solve the subproblems encountered in the main algorithm once and memoize the results. We 
give a simple proof of correctness and explore the practical advantages over the earlier method. 
The Nussinov algorithm admits an O(n^) time parallel algorithm. We show a parallel algorithm using the two-vector 
idea that improves the time bound to O(j^). 

Practical results: We have implemented the parallel algorithm on graphics processing units using the CUDA 
I platform. We discuss the organization of the data structures to exploit coalesced memory access for fast running 
times. The ideas to organize the data structures also help in improving the running time of the serial algorithms. For 
sequences of length up to 6000 bases the parallel algorithm takes only about 2.5 seconds and the two-vector serial 
I method takes about 57 seconds on a desktop and 1 5 seconds on a server. Among the serial algorithms, the 
two-vector and memoized versions are faster than the Frid-Gusfield algorithm by a factor of 3, and are faster than 
Nussinov by up to a factor of 20. The source-code for the algorithms is available at http : //github . com/ 
i j alabv/FourRussiansRNAFolding. 

Keywords: RNA-folding, Four-Russians, CUDA, Algorithms, Parallel algorithm, GPU 



Background 

Computational approaches to find the secondary struc- 
ture of RNA molecules are used extensively in bioinfor- 
matics applications. 

The classic dynamic programming (DP) algorithm pro- 
posed in the 1970s has been central to most structure 
prediction algorithms. While the objective of the original 
algorithm was to maximize the number of non-crossing 
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pairings between complementary bases, the dynamic pro- 
gramming approach has been used for other models and 
approaches, including minimizing the free energy of a 
structure. The DP algorithm runs in cubic time and there 
have been many attempts at improving its running time 
[1,2]. Here, we use the Four-Russians method for speeding 
up the computation. 

The Four-Russians method, named after Aralazarov 
et al. [3], is a method to speed up certain dynamic 
programming algorithms. In a typical Four-Russians algo- 
rithm there is a preprocessing step that exhaustively enu- 
merates and solves a set of subproblems and the results 
are tabled. In the main DP algorithm, instead of filling out 
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or inspecting individual cells, the algorithm takes longer 
strides in the table. The computation for multiple cells is 
solved in constant time by utilizing the preprocessed solu- 
tions to the subproblems. The longer strides to fill the 
table reduce the runtime by a multiplicative factor. The 
size of the subproblems is chosen in a way that does not 
make the preprocessing too expensive. 

Frid and Gusfield [4] showed the application of the Four- 
Russians approach for RNA folding. In their algorithm, 
the preprocessing is interleaved with the algorithm com- 
putation. They fill out a part of the DP table and use 
these entries to complete a part of the preprocessing. The 
preprocessed entries are used later in the computation. 

We show a simpler algorithm, where, all the preprocess- 
ing is completed before the start of the main algorithm. 
This simplifies the correctness proof and the runtime 
analysis. This approach helps in obtaining an Oiy? j logfz) 
time parallel algorithm, which is a log n factor improve- 
ment over previously-known parallel algorithms. Zakov 
and Frid (personal communication) had also observed 
that the algorithm in [4] could be modified to do the 
preprocessing once at the start of the algorithm. It is 
essentially the idea described here. 

In this paper we explore the implications of the one- 
pass preprocessing idea. This description of the algorithm 
leads naturally to two other variants. We empirically eval- 
uate these variants and also the implementation of the 
parallel algorithm. 

The parallel architecture of general-purpose graphical 
processing units (GPUs) have been exploited for many 
real-world application in addition to applications in gam- 
ing and visualization problems. 

GPUs have also been used to speed up RNA folding 
algorithms [5-7]. Here we show how the Four- Russians 
method allows an organization of the data structures for 
fast memory accesses. We also describe the organization 
of the parallel hierarchy to exploit the inherent parallelism 
of the solution. 

In the rest of the section, we describe the problem in 
relation to the other problems in RNA folding. To keep 
the paper self-contained, we will first describe the two- 
vector algorithm, our application of the Four-Russians 
method to the RNA folding problem. We will use that 
description to describe the original Four- Russians method 
for RNA folding by Frid and Gusfield [4]. This discus- 
sion leads to two other variants where the preprocessing 
is done on demand, instead of the exhaustive prepro- 
cessing in the two-vector method and the Frid-Gusfield 

2 

algorithm. In Section 'An 0(j^^) parallel algorithm' we 

discuss the 0(n^/logn) parallel algorithm. We will then 
describe the implementation of a parallel algorithm using 
CUDA. The final sections have discussion on empirical 
observations and conclusions. 



Related work 

The O(w^) dynamic programming algorithm due to 
Nussinov et al. [8,9] maximizes the number of non- 
crossing matching complimentary bases. There have been 
many methods since Zuker and Stiegler [10] that infer the 
folding using thermodynamic parameters [11,12] which 
are more realistic than maximizing the number of base 
pairs. These methods have been implemented in many 
packages including UNAFold [13], Mfold [14], Vienna 
RNA Package [15], RNAstructure [16]. 

ProbabiUstic methods include stochastic context-free 
grammars [17,18], the maximum expected accuracy 
(MEA) method, where secondary structures are com- 
posed of pairs that have a maximal sum of pairing prob- 
abilities, e.g., MaxExpect [19], Pfold [20], CONTRAfold 
[21] which maximize the posterior probabilities of base 
pairs; and Sfold [22], CentroidFold [23] that maximize the 
centroid estimator. There are also other methods that use 
a combination of thermodynamic and statistical parame- 
ters [24] and methods that use training sets of known folds 
to determine their parameters, e.g., CONTRAfold [21], 
Simfold [25] and ContextFold [26]. 

In addition to the Four- Russians method, other methods 
to improve the running time include Valiants max-plus 
matrix multiplication by Akutsu [1] and Zakov et al. [2]; 
and sparsification, where the branch points are pruned to 
get an improved time bound [27,28]. 

CUDA, the programming platform for GPGPUs, has 
been used to solve many bioinformatics problems. Chang, 
Kimmer and Ouyang [5] and Stojanovski, Gjorgjevikj and 
Madjarov [7] show an implementation of the Nussinov 
algorithm on CUDA. Rizk et al. [6] describe the imple- 
mentation for Zuker and Stiegler method involving energy 
parameters. These methods are discussed later in the 
parallel implementation section. 

The Nussinov algorithm 

In this paper, we consider the basic RNA folding problem 
of maximizing the number of non-crossing complimen- 
tary base pair matchings. Complimentary bases can be 
paired, i.e., A with U and C with G. A set of disjoint pairs is 
a matching. The pairs in a matching must not cross, i.e., if 
bases in positions / and 7 are paired and if bases k and / are 
paired, then either they are nested, i.e., i < k < I < j ov 
they are non-intersecting, i.e., i < J < k < 1. The objec- 
tive is to maximize the number of pairings under these 
constraints. 

The following algorithm, due to Nussinov [8] maxi- 
mizes the number of non-crossing matchings. For an 
input sequence S of length n over the alphabet A , C , G , 
U, the recurrence is defined as follows. Let D{Uj) denote 
the optimal cost of folding for the subsequence from / to 7. 
For all /, D{U i — 1) = D{U 0 = 0 and for all / < /: 



Venkatachalam etal. Algorithms for Molecular Biology 2014, 9:5 
http://www.almob.0rg/content/9/l/5 



Page 3 of 12 



' [ max/+i<^<y D(i, k-l)-\- D(k,j) 



where b(,, .) = 1 for complimentary bases and 0 other- 
wise. The DP table is the upper triangular part of the nxn 
matrix. The optimal solution is given by D(l, n). The table 
can be filled column-wise from the first column till the 
n^^. There are other ways of filling the table too, e.g., along 
the diagonals — the (/, 0 -diagonal first, (/, / + 1) -diagonal 
next and so on, until the last diagonal with one entry, 
D(l,n). To allow for traceback we need to store the bases 
that are paired to get the maximum value. Let /)*(/,;) 
denote the corresponding indices. These are obtained 
by substituting argmax in place of max in the above 
recurrence and can be computed along with the max 
value. 

The first part of the recurrence can be solved in constant 
time. The second part is more expensive, incurring B (n) 
look ups and maximum computations. There are 0(n^) 
entries in the DP table and each cell can be computed in 
0(n) time, giving an 0(n^) time algorithm. 

The Four-Russians algorithms 

In this section we discuss three variants of the Four- 
Russians algorithm. We will first describe the two-vector 
approach. Since it is simpler than the other methods we 
will use the description to discuss two other variants. 

Two-vector algorithm 

To apply the Four-Russians technique we start with the 
following observation: 

Lemma The values along a column from bottom to 
top and along a row from left to right are monotonically 
non-decreasing Consecutive cells differ at most by 1. 

Proof. Consider the values of neighboring cells (/,/) 
and (/ + 1,/). D(iJ) represents the solution of a longer 
sequence than D(i + 1,;). Therefore the former value 
should be at least as large as the latter. Suppose D(iJ) dif- 
fered from D(i + 1,/) by more than one. Then we can 
remove any matching for /. This has at most one fewer 
base pair matching and is a valid solution for the subse- 
quence (/ +1,;) with a larger value than its current value, 
contradicting the optimality of D(i + 1,;). An analogous 
argument holds along the rows. □ 

Once the cells D(i, I), D(i, / + 1), . . . , D(i, I -\- q - 1) dive 
computed, for some / e {/,... — q}, they can be rep- 
resented by D(i, I) + Vo, D(i, /) + V^i, . . . , D(i, I) + 



where Vp = D(iJ-\-p) - D(i, /), for p e {0, . . . ,q - 1}. Let 
us define, Vo = Oandv^ = Vp — Vp-i,iorp G {1, . . .,q—l}. 
From lemma 3.1, Vp e {0, 1}, for all p G [0,^ — 1]. Let 
V denote the binary vector vq, vi, . . . , Vq-i of differences 
and let V denote the vector of running totals Vo, Vi, . . . , 

Since the VpS are defined from s, the inverse func- 
tion is well defined: Vp = jy^=o ^k- Thus /) together 
with the vector v represents q consecutive cells of the 
table. 

Similarly, since the values are non-increasing down a 
column, D{i + / + 1,/), . . .,£>(/ + / + q^j) be represented 
by the pair£)(/ + / + 1,/), v, where v G {0, —1}^. The corre- 
sponding vector of cumulative sums is denoted V, We call 
y the horizontal difference vector or the horizontal vector 
and we call v the vertical difference vector or the column 
vector. 

Consider q consecutive cells from / + 1 to / + ^ used in 
computing D(iJ): 



D(iJ) ^ max D(i, /c - 1) + D(kJ) 

l-\-l<k<l-hq 



(2) 



max Dii, /) + + Dii + / + 1,;) + Vj, 

0<k<q-l 

D{U I) + D{i + / + 1,/) + max I4 + Vk (3) 

Q<k<q-l 



As before, we use argmax in place of max to obtain 
D''{i,j), which facilitates the traceback. 

As noted above the second line of the recurrence (1), 
looping over elements, is more expensive part of the com- 
putation and we will use (3) instead of (2) to compute the 
D and D* values in the Four-Russians method. That is, 
we will use (3) over groups of q cells each instead of one 
loop of (1). Since the V vectors are in bijection with the v 
vectors, we will use v in the computation. Let v and v be 
the corresponding vectors in (3). The following algorithm 
evaluates the max computation. 



Input: horizontal difference vector v and vertical differ- 
ence vector V 

1: max-val ^ 0 and max-index ^ 0 

2: sumi ^ 0 and sum2 ^ 0 

3: for /: = 0 to ^ — 1 do 

4: sumi ^ sumi + v/ 

5: sum2 ^ sum2 + Vi 

6: if sumi -\- sum2 > max-val then 

7: max-val ^ sumi + sum2 

8: max-index ^ k 

9: end if 
10: end for 

11: return {max-val ^ max-index) 
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Using this instead of (2) is not advantageous in itself. How- 
ever, if this algorithm is given as a black box, D{Uj) can 
be computed in constant time by invoking the black box 
once. To exploit this fact, we will preprocess this compu- 
tation over all possible v and v vectors and tabulate the 
results in R. Table R is indexed by a pair of numbers in 
the range [2^] to represent the two vectors (v, v). The table 
lookup is a constant time operation as it fetches the max 
and arg max values. We will show later that this exhaustive 
enumeration is not too expensive. 

In the Nussinov algorithm described in the previous 
section, the recurrence over q cells is evaluated using (2) 
and it takes 0{q) time. In the Four- Russians method, using 
the preprocessing step, the max computation is available 
through a table lookup and the recurrence for q terms can 
be completed in constant time. This reduction in the com- 
putation time is the reason for the speedup by a factor of q. 

The two-vector method modifies the Nussinov algo- 
rithm as follows. All the rows and columns of the table 
are grouped into groups of q cells each. The recurrence 
over these q cells is computed in constant time using the 
preprocessing table. The recurrence involves D{U /:—!) + 



D(kJ), i.e., the value in the (k — ly^ column is used 
with the k^^ row. Therefore the row and column group- 
ings differ by one. That is, the columns are grouped 
(0, 1, . . . , — 1), (q,q -\- 1, . . . ,2q — 1) etc. The rows are 
grouped (1, 2, ... , q), (^+1, ^+2, . . . , 2q) etc. This ensures 
that the row and column groups are well characterized. 
That is, to fill the cell (/,;), the k^^ group along row / needs 
to be combined with the k^^ group below (/,/) in column /. 

The cells of the table are filled in the same order as 
before. When the last cell of a row- or a column- group is 
evaluated the corresponding row and column vectors are 
computed and stored. To fill cell (/,/), we retrieve the first 
element and the horizontal vector of the group from row / 
and the first element and the column vector from the cor- 
responding group in column The recurrence is solved 
using (3) by a table lookup. The final value for D(iJ) is the 
maximum value over all the groups. There might be resid- 
ual elements in the row that do not fall in these groups. 
There are at most 2q such elements. These are solved 
separately using Nussinov s method. Algorithm 1 has the 
algorithm listing and Figure 1 describes the algorithm 
pictorially. 



Algorithm 1 Procedure for the two-vector Four- Russians speedup. The DP table is filled column-wise. 

1: R ^ preprocess all pairs of vectors of length q 

2: for / = 1 to ;^ do 

3: D(jJ) ^ 0 

4: + ^0 

5: for i =J — 1 down to 1 do 

6: D(iJ) ^ b(S[i] , S[J] ) + D(i + 1,; - 1) 

7: Let (/, /) be in the I^^ group in row /. 

8: Let (/,/) be in the J^^ group horizontally in the fi^ row and J^^^ group vertically in the column. 

9: Let // be the right-most entry of group / and jj be the left-most entry in group / 

10: for k = i-\-lto ij do // For all cells in the first group 

11: D(iJ) ^ max(D(/,;), D(i, k - 1) + D{Kj)) 

12: end for 

13: for k = jj to / do // For all cells in the last group 

14: D{hj) ^ max(D(/,;), D{U k - I) + D(kJ)) 

15: end for 

16: for /<r = 1 to / — / do // For all groups in between 

17: Let / be the left-most cell in the lO^ group to the right of / and t be the top-most cell in the K^^ group below 
/. 

18: Let Yi^K and v/q be the corresponding horizontal and vertical difference vectors. 

19: D(iJ) <- max(D(/,;), D(i, I) + D(tJ) + i?(v,-,/c, v/q)) 

20: end for 

21: if / mod q = 1 then // compute the vertical difference vector 

22: compute and store the v vector i/q^^ group for column / 

23: end if 

24: if / mod q = q — 1 then // compute the horizontal difference vector 

25: compute and store the v vector (7 — 1) group for row / 

26: end if 

27: end for 

28: end for 
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Runtime analysis 

In the precomputation phase, there are 2^ ^-length vec- 
tors and 2^^ pairs of vectors. The precomputation takes 
0(q) time per vector pair. Thus the total time for precom- 
putation is 0(q2^^). 

The main algorithm: There are O(n^) cells and to fill 
each cell it takes 0{n/q + q) time. That is, it takes 0(n/q) 
time to look up the initial value and the difference vec- 
tor and the R table lookups for the the 0(n/q) groups. It 
takes 0(q) time for the residual elements. Thus it takes 
0(n^ X (n/q -\- q)) time to fill the table. Every cell is invol- 
ved in at most two vector computations, where the dif- 
ference to its neighbor is computed once for the row and 
for the column vector. This takes an amortized 0(n^) time 
which is dominated by the rest of the algorithm. 

When q = log n, the total time for the entire algorithm is 
0(logn22l«g^ + + n2x(j^ +logn)) = 0(^2 log n + 

n^/logn) = 0(n^ / logn). 
FG algorithm 

Frid and Gusfield [4] first showed how the Four-Russians 
approach could be applied to the RNA-folding problem. 
We will call their algorithm the FG algorithm. FG and two- 
vector algorithms are variants of the same idea. We will 



highlight the differences in preprocessing and the maxi- 
mum value computation by the Four- Russians technique. 
In particular, we will show the maximum computation in 
step 19 of Algorithm 1. 

After computing the <5^-contiguous cells of a group in a 
row, the value in the initial cell D{Up) and the horizontal 
difference vector Vp are known. They run the preprocess- 
ing algorithm in page 3 for this fixed Vp vector together 
with all possible vertical difference vectors. They add the 
value of D{i,p) to the maximum and table the result. 
This preprocessing step is computed for every block of 
every row. The preprocessing table R is indexed by row 
number, group number and a vector (which is a poten- 
tial column vector). The horizontal vectors need not be 
stored. 

Notice the difference between the two-vector and the 
FG variants. In two- vector algorithm the preprocessing 
for vector pairs is computed once for each distinct pair 
of vectors. Whereas, in FG the preprocessing step is run 
once for each group of each row, even if the vector pair was 
seen earlier. This is because the table contains the result of 
addition of the initial cell of the group D{Up), 

To fill cell (/,;), they iterate over all groups and find the 
<5^-length column vectors. The preprocessed value for this 
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vector in the corresponding block is retrieved from the 
table and the result is added to D(q,J). 

The preprocessing is for horizontal vectors seen in the 
table. Since the horizontal vectors are not known before- 
hand, the precomputation cannot be done prior to the 
main algorithm. Instead, it is interleaved with the compu- 
tation of the table. They fill part of the DP table and use 
the vectors to complete some preprocessing, which in turn 
is used fill another part of the table and so on. 

Since the preprocessing is done for every group of every 
row, the same horizontal vector can be seen multiple times 
in the table. This leads to duplicated work and slower 
running time than the two-vector algorithm. 

3 

The running time for the FG method is Oi^^—^), where 
2 < b < n and M = n. 

Two variants that memoize 

The two-vector method computes the preprocessing over 
all possible vector pairs. While the computation is for 
done exactly once for the unique vector pairs, some of 
these vector pairs might not be seen in the table. In the FG 
method, the precomputation step is for only the horizon- 
tal vectors that are seen in the table. However, for some 
vectors, the computation is duplicated. Stated this way, a 
hybrid approach suggests itself. 

In our next variants, we memoize the results for a pair 
of vectors. Like the two-vector approach, the preprocess- 
ing is done only once for a vector pair and like the FG 
algorithm, it is only for the vectors seen in the table and 
the preprocessing is interleaved with the main algorithm. 
Since the preprocessing table is indexed by two vectors, 
unlike the FG algorithm, the results are computed only 
once for every vector seen. 

In the partially memoized version, upon completion of 
elements of a group, if a new horizontal vector is seen, we 
pair it with all possible 2^ column vectors and the results 
are tabled. In the completely memoized version, the result 
for a pair of horizontal and vertical vectors are computed 
the first time the pair is observed and the result is stored 
in the table. The computed values for retrieved from the 
table when they are seen again. The rest of the algorithm 
is identical to the two-vector method. 

All these variants take 0{ir' /\o%n) time but the memo- 
ized versions potentially store fewer vectors than the two 
vector method and will have a similar worst-case run- 
time in practice as the two-vector method. But, as argued 
before, the FG method does duplicated work and will be 
slower in practice. 

2 

An O(j^) parallel algorithm 

The Nussinov DP algorithm can be parallelized with 
n processes to get an 0{rP') parallel algorithm on a 



concurrent-read concurrent-write parallel random access 
memory (CRCW PRAM) machine . In the parallel algo- 
rithm, we fill the table diagonal by diagonal. We use n 
processes and assign one parallel process to each column. 
In the fi^ iteration, the p^^ process computes the value for 
the ip — diagonal entry. That is, in the first iteration, 
the bottom-most cell in each column, i.e., the entries in 
the main diagonal are solved and in successive iterations, 
the diagonals above are solved. To compute the value for 
cell (/,;), the entries in the row to its left and in the column 
below (/,;) are needed. The entries in the same column 
are computed in earlier iterations. Similarly, the entries 
on the left are solved by other processes in earlier itera- 
tions. Since these values are computed in earlier iterations, 
each diagonal cell can be filled independent of the other 
processes. 

More formally, we have / g [n] parallel processes and the 
process computes the values of values along the col- 
umn 7. All processes synchronize after filling an entry; this 
ensures that values needed to fill a cell are computed by 
the other processes. 

The parallel algorithm for process j for / = 1, 2, . . . , 

1: ^0,D(;,;) ^0 

2: for / = j down to 1 do 

3: D{Uj) ^ D(i + 1,; - 1) + b(S[i] , S\j] ) 

4: for /: = / + 1 to / do 

5: D(iJ) ^ max{D(/,7), D(i, k - 1) + D(kJ)} 

6: end for 

7: Synchronize with other processes 
8: end for 

A process has to compute the value for 0(n) cells and for 
each cell it needs to access 0(n) other cells. Thus the total 
computation takes 0(n^) time with n processes. 

We will describe the use the two-vector Four-Russians 
method to obtain an 0(n^/logn) algorithm below. The 
preprocessing step that enumerates the solution for 2^ x 
2^ difference vectors is embarrassingly parallel and we do 
not discuss the parallel algorithm for it. 

As before, we fill the table along the diagonals. We use 
n processes, one for each column. Each process solves the 
entries of the column from bottom to top. As in the serial 
algorithm, computing the maximum by looping over all 
the entries is the expensive part of the computation and 
will be optimized. Instead of looping over the individual 
entries (lines 4 - 6 in the parallel algorithm above), we use 
the Four-Russians technique to solve q cells in one step 
by looking up the table computed in the preprocessing 
step. 

Let dniiyj) be the horizontal difference vector for 
cells D{Uj), . . . ,D{i -\- q — 1,/) and let dv(i>J) be the 
vertical difference for cells D{Uj)f . . . ,D{i + q — 1,/). 
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We modify the inner loop of the parallel algorithm as 
follows: 

1: for/c' = Oto [j/qi - 1 do 
2: k = i-\-k' 

3: D{Uj) = max{D(/,;),D(/,/:) + D(k + 1,;) + 

R[dH(i.k)] [dv(k + h;)]} 
4: end for 

5: for k = IJ/qj X qtoj do 

6: D(iJ) <- max{D(/,7), D(i, k) + D{k + 1,/)} 

7: end for 

8: Compute the horizontal and vertical differences and 
store them in dnii — q + 1 j) and dyiij) respectively. 

For each entry, the first loop takes 0(n/q) time and the 
second loop takes 0(q) time. Since all the processes are 
solving the k^^ diagonal in the k^^ iteration, all of them 
execute the same number of steps before synchronization. 
Note that we compute the horizontal and vertical differ- 
ences for every node, unlike in Algorithm 1 where they 
are computed every q^^ cell, to ensure that every process 
performs the same number of steps and simplify the anal- 
ysis. The difference vectors can be computed in 0(q) time. 
These can also be computed in constant time by shifting 
the previous difference vector and appending the new dif- 
ference. But we will not assume this simplification for the 
time bound computation. 

Thus each entry can be computed in 0(n/q + q) time. 
There are 0(n) entries for each process, thus the total time 
taken for all processes to terminate is 0(n^/q + nq). With 
q = logn as before, this gives an OirP' /logn) algorithm. 

Parallel implementation 

GPU architecture 

Graphics processing units (GPUs) are specialized pro- 
cessors designed for computationally intensive real-time 
graphics rendering. In addition to manipulating graphics 
objects the parallel architecture can be exploited for other 
tasks where large amounts of data are to be processed in 
parallel. Compute Unified Device Architecture (CUDA) 
is the computing engine designed by NVIDIA for their 
GPUs. It allows the programmer to write highly parallel 
code and provides platform-specific optimizations. 

In CUDA, a serial program on a "host" CPU launches 
parallel "kernels" on the "device" GPUs. Kernels specify 
the code to be executed by all the threads. Every thread 
executes the same code in a kernel but can be assigned 
a different part of the task based on their indices. This 
paradigm is called Single-Instruction Multiple-Thread 
(SIMT), which is similar to Single-Program Multiple Data 
(SPMD) where the threads are run almost in lockstep. 

The programmer can group threads in a blocks which 
in turn can be organized in a grid hierarchy. The threads 



in a block and blocks in a grid can be organized in one-, 
two- or three-dimensions. While the hierarchy is specified 
when launching a kernel, thread management is handled 
by the underlying system. The threads within a block use 
barrier synchronization. Different blocks communicate by 
atomic memory operations in global memory. 

Memory hierarchy includes thread-specific local mem- 
ory, block-level shared memory for all threads in the block 
and global memory for the entire grid. The access times 
increases along the hierarchy from local to global memory. 

Kernels are very fast when threads run in lockstep. If 
certain threads take a conditional branch or are delayed by 
memory access then all the threads in the block are stalled. 
Since the access to global memory is slow (more clock 
cycles than local memory access), it is efficient for the 
threads within a block to access contiguous memory loca- 
tions. Then the hardware coalesces memory accesses for 
all threads in a block into one request. More specifically, 
in our application, if a matrix is stored in row-major order 
and if the threads in a block access contiguous elements of 
a row, then the accesses can be coalesced. Whereas access- 
ing elements along a column is inefficient as distant mem- 
ory elements have to be fetched from different cache lines. 

Programs that observe the hardware specifications can 
exploit the optimizations in the system and are fast in 
practice. We designed the program that exploits the par- 
allel structure of the DP algorithm and the hardware 
features of the GPU. 

Related work 

As mentioned earlier, the cells of a diagonal are inde- 
pendent of one another and can be computed in parallel. 
In Stojanovski et al [7], elements of the diagonal are 
assigned to a block of threads. This design does not handle 
memory coalescence for either row or column accesses. 
Chang et al. [5] allocate an n x n table and reflect the 
upper-triangular part of the matrix on the main diago- 
nal. Successive elements of a column are fetched from the 
row in the reflected part of the matrix. When threads of a 
block are assigned to elements of a diagonal, the successive 
column accesses for a thread are to consecutive mem- 
ory cells. However, this does not allow coalesced access 
for threads within a block. Rizk and Lavenier [6] show 
an implementation for RNA folding under energy mod- 
els. They show a tiling scheme where a group of cells are 
assigned to a block of threads to reuse the data values that 
are fetched from a column. In this paper, we show that 
storing the row and column vectors in different orders for 
two-vector method can further improve the efficiency. 

Design of the Four-Russians CUDA program 
The high-level Idea 

Like in the parallel algorithm described in Section 'An 

2 

0(j^^) parallel algorithm', we will fill the cells of the table 
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along the diagonals. However, for efficiency reasons, we 
cannot fill each diagonal in order. For example, by assign- 
ing different cells of a diagonal to a block of threads, each 
thread has to access a different row and different column. 
This design is not efficient as threads in a block try to 
access different locations of global memory which slows 
down the program. For the program to be efficient on the 
CUDA architecture, we exploit coalesced memory access 
for faster computation and ensure that individual threads 
are not stalled. We create a tiled table by grouping q x q 
cells into a tile and assign a block of threads to fill a tile. We 
will show that the tiles are independent and assign a grid 
of blocks to solve a diagonal of tiles in parallel. Thus the 
implementation can be thought of as a generalization of 
the Four-Russians algorithm on the tiled table. The details 
follow. 

Data structures 

Notice that cells D(iJ) and D(i — 1,;) both use values 
in column j below D(iJ), Similarly, D(iJ + 1) and D(iJ) 
refer to values in row / to the left of cell (/,/). By grouping 
these cells together and assigning this group to a block of 
threads the values along the rows and columns should be 
fetched once per block. We will group q x q cells together 
and store the values from the rows and columns in shared 
memory. As noted earlier, for the four- Russians method, it 
is convenient to group the rows kq -\- 1, . . . , (k -\- l)q and 
columns kq, . . . , (/c + l)q — 1, for k e {0, . . . , [n/qj }. We 
now have a tiled table, where each tile is a composite of 
q X q cells. The tiles along a diagonal can be computed 
independent of each other. Each tile is assigned to a block 
of threads and computed in parallel. After all the entries 
of the tile are computed, only the horizontal and vertical 
differences are stored. The horizontal and vertical differ- 
ence vectors are used by the Four-Russians technique for 
later computations. 

To fill a tile, the horizontal differences of all the tiles to 
the left and vertical differences from the tiles underneath 
are accessed. By storing these difference vectors together 
the memory accesses can be coalesced. That is, we store 
the horizontal differences of the rows in a tile together. 
Similarly, the vertical differences of the tile are grouped 
together. However, the horizontal and vertical differences 
of the tiles are stored in different order. The horizontal 
differences are stored in row-major order and the verti- 
cal differences are stored in column-major order to exploit 
coalesced memory access. 

When each thread retrieves one vector from a tile, the 
block of threads accesses contiguous memory locations 
and the memory accesses are coalesced. Successive itera- 
tions fetch the vectors from tiles along a row which are in 
contiguous memory locations. Similarly the vertical dif- 
ferences of a tile below are accessed in one coalesced 
memory access by the threads of the block. 



Since we group q elements together, at the last column 
of tiles might have fewer than q elements and handling the 
corner cases at the GPU will involve extra checks which 
might slow the program down. We can avoid these by 
padding the sequence with extra characters. The modified 
string has W = n -\- q — n mod q characters. The final 
result is still stored in cell £)(!,«). 

Thread hierarchy 

There are a few options for assigning threads to compute 
various cells in a tile. We can assign q^ threads, one per 
cell, or assign q threads, one per row or column. While it 
appears that q^ threads admit more parallelism, using q 
threads is more advantageous for a number of reasons. 

There is a limit on the number of threads that can be 
allocated on a CUDA device. Requesting q^ threads per 
kernel might limit the number of kernels that can be 
launched in parallel. At various synchronization points 
and in conditional branches (like updating a new maxi- 
mum value in a cell) all threads are stalled. Moreover, with 
q^ threads, most of the threads will remain idle for many 
iterations due to the dependency on other cells. Instead, 
the q threads can be organized to keep the threads active 
in most iterations and perform the same computations 
with fewer stalls. 

The algorithm 

Let N be the number of diagonals in the tiled table [N = 
{W + l)/q). The main algorithm run on the CPU is as 
follows. The various kernels are described next. 

1: Complete the preprocessing in the parallel. 

2: Launch N init_diag kernels to solve the first two 

diagonals. 
3: for / = 2 to / do 

4: Launch {N — i) simple_kernel kernels. 

5: end for 

6: for i = ItoN do 

7: Launch (p x (N — /)) parallel_kernel ker- 
nels. 

8: On completion of the previous step, launch N — i 

combine_kernels 
9: end for 

We will first describe simple_kernel. The other ker- 
nels have a similar structure and are described next. The 
thread solves the j column for all rows of the tile. It 
fetches the horizontal difference and initial cell value of 
the row of a tile to the left and stores it in shared mem- 
ory. The other threads in the block fetch from the other 
rows of the tile to shared memory. These reads occur 
in parallel and are coalesced as they refer to contiguous 
memory addresses. The 7* thread then fetches the ver- 
tical difference and initial value of the column of the 
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corresponding tile below. Since only the / thread needs 
this data, it stores it in local memory. 

The threads use these values and compute the values for 
the all the rows. This is repeated for all tiles. This corre- 
sponds to the loop in lines 16-20 of Algorithm 1. Finally 
to complete filling the tile, the rows are filled from bottom 
to top. At every row, line 5 and the loop in lines 10-12 are 
executed in parallel. Then the loop in lines 13-15 are exe- 
cuted. The threads cannot independently to complete this 
loop. The threads use synchronization to ensure that the 
values in the columns to the left are written before they are 
used. Finally, the thread computes the horizontal dif- 
ferences of the J row and the vertical differences of the 
column and store the values in the corresponding memory 
locations. 

The parallel kernels 

For every diagonal element, the Four- Russians part of the 
code has to fetch the values from tiles below and to the 
left. Closer to the top of the matrix, there are more tiles 
to retrieve data from. Since these tiles are all independent, 
the respective computations can be parallelized. For the 
k^^ diagonal, we launch Ik/pj kernels per tile. Each kernel 
will iterate over p tiles and store the result in a temporary 
location. We will then launch another kernel which will 
iterate over the remaining tiles and then complete the rest 
of the steps like simple_kernel to compute the values 
of the individual cells of the tile and store the horizontal 
and vertical differences. 

Initial diagonals 

The fi^ block of the init_diag kernel will solve the 
(/, i)^^ diagonal tile and (/, / + l)^'^ tile. The values in the 



Table 1 Speedup factors of the serial programs on the 
desktop 





Time 




Speedup 






Length 


Nussinov 
(in sees) 


Two-vector 


Partially 
memoized 


Completely 
memoized 


FG 


2000 


16.5 


7.7 


7.3 


5.6 


3.0 


3000 


62.5 


8.8 


8.3 


6.4 


3.4 


4000 


196.6 


11.9 


11.4 


8.8 


4.7 


5000 


630.3 


21.1 


18.9 


14.7 


7.8 


6150 


1027.8 


18.1 


17.0 


13.3 


7.03 



(/ + 1, / + 1)^^ diagonal tile are also needed to solve the 
(/, / + 1) the tile. Instead of waiting for a different kernel to 
solve it and then read it from global memory, those values 
are also solved by the fi^ block locally but the results are 
not stored. The (/ + l)^'^ block also computes these values 
and stores the horizontal and vertical differences. 

The values for these tiles are computed by standard 
Nussinov algorithm. To solve the (/, / + l)^'^ tile, the 
thread solves the diagonal like in simple_kernel. 

Empirical results 

Prior to empirical evaluation, the FG algorithm was 
expected to be the slowest due to the repeated computa- 
tion. The memoized versions were expected to be faster 
than the two-vector algorithm, as they preprocess only a 
subset of the 2^^ vectors seen in the table. 

We ran the programs on complete mouse non-coding 
RNA sequences. We also tested the performance on ran- 
dom substrings on real RNA sequences and random 
strings over A , C , G , U. 
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The FG algorithm, while faster than Nussinov, was the 
slowest among the Four-Russians methods, as expected. 
The completely memoized version was slower than the 
other two variants. This is because every lookup of the 
preprocessing table includes a check to see if the pair of 
vectors has already been processed. There are 2^^ unique 

3 

vector pairs but there are 0(y ) queries to the preprocess- 
ing table and each query involves checking if the vector 
pair has been processed plus the processing time for new 

2 

pairs. There are 0(\) vector pairs in the table. For larger 

n (e.g., n > 1000 and q = 8), all the 2^^ vectors are 
expected to be present in the DP table. Generally, memo- 
ized subproblems are relatively expensive compared to the 
lookup. Since the preprocessing here has only q steps, the 
advantage of memoization is not seen. 

The partially memoized version was slightly slower than 
the two vector algorithm. Again, the advantage of poten- 
tially less preprocessing than the two-vector method is 
erased by the need to check if a vector has been processed. 
The two-vector method was the fastest on all sequence 
lengths tested. 

For short sequences the two vector method took negli- 
gible time (less than 0.2 seconds up to 1000 bases) and are 
not reported. For longer sequences, we noticed that using 
longer vector lengths reduced the running time. How- 
ever the improvement saturated at ^ = 8 or 9 (Figure 2). 
Beyond this, the extra work in preprocessing overshad- 
owed the benefit. A similar trend was seen for the memo- 
ized versions too. However, for the FG method q = 3 gave 
the best speedup and longer vector lengths had a slower 
running time due to the extra preprocessing at every 
group. 



The algorithms implemented compute the same match- 
ings as the Nussinov algorithm. The correctness of each 
implementation was evaluated by comparing the entried 
of the DP table of the Nussinov algorithm to the analogous 
values computed by the faster implementation. All the 
programs were written in C++ compiled with the highest 
compiler optimizations. We only discuss the experimental 
results on a desktop and two GPU cards in this paper. 

We measured the running times of the different ver- 
sions of our serial algorithms on a desktop machine with a 
Pentium II 3 GHz processor and 1 MB cache. The running 
times of Nussinov and the speedups of various programs 
compared to Nussinov are shown in Table 1. 

The times reported are an average over 10 sequences 
of approximately the same lengths. Among the serial pro- 
grams tested, FG had the slowest running times and two- 
vector method had the best running times. For sequences 
of length 6000, the two-vector method takes close to a 
minute on the desktop. As discussed earlier, the extra 
steps to check if a vector or a pair of vectors have already 
been processed takes longer than the benefit of potentially 
fewer steps needed for preprocessing. 



Table 2 Running times for the parallel program (in sees) 



Length 


On GeForce 


On Tesia 


2000 


0.20 


0.14 


3000 


0.62 


0.38 


4000 


1.36 


0.74 


5000 


2.70 


1.39 


6000 


4.97 


2.50 
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On sequences of length 5000 bases, two vector had a 
20 times speedup, and FG has a 7 times speedup over the 
Nussinov program on a desktop machine with a Pentium 
II 3 GhZ processor and 1 MB cache. On a server with 
four 64-bit cores of Xeon 2.8 GhZ machines with 8 MB 
primary cache, the running times of all the programs were 
faster, but the relative speedup on Nussinov was more 
drastic than that of other programs. The same relative 
trends for the Four-Russians programs were seen — two- 
vector method was faster than the partially memoized 
problem; the totally memoized version, while faster than 
the FG algorithm was slower than the other two vari- 
ants. Figure 2 shows the running times of the two-vector 
method on the desktop and servers, respectively. 

Figure 3 shows the execution times on two GPU cards - 
GeForce GTX 550 Ti card with 1 GB on-card memory and 
Tesla C2070 with 5 GB memory. The programs take about 
a second for sequences up to 4000 bases long, and takes 
about 5 seconds and 2.5 seconds for sequences of length 
6000. The running times for various sequence lengths are 
shown in Table 2. Even in the parallel implementation, we 
see the same trend with increasing the vector lengths — 
there is a marked decrease in running time by increas- 
ing the vector lengths from ^ = 3, and the improvement 
saturates around q = 8 (Figure 3). 

Details on running times of the other variants can be 
found in the technical report [29]. 

Conclusions 

We described the two-vector method for using the Four- 
Russians technique for RNA folding. This method is 
simpler than the Frid-Gusfield method. It also improves 

the bound of the parallel algorithm by a logn factor to 

2 

0(r4— ). We showed two other variants that memoize 

^ log n ^ 

the preprocessing results. These methods are faster than 
Nussinov by up to a factor of 20 and the Frid-Gusfield 
method by a factor of 3. 

In the future, it will be interesting to see the applica- 
tion of the Four-Russians technique for other methods 
that use energy models with thermodynamic parame- 
ters. The Frid-Gusfield method has been applied to RNA 
co-folding [30] and folding with pseudoknots [31] prob- 
lems; the application of the two-vector method to those 
problems and its implications are also of interest. It will 
be interesting to compare our run time with the other 
improvements over Nussinov, like the boolean matrix 
multiplication method [1]. 
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